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Abstract 

BitTorrent suffers from one fundamental problem: the 
long-term availability of content. This occurs on a massive- 
scale with 38% of torrents becoming unavailable within 
the first month. In this paper we explore this problem 
by performing two large-scale measurement studies includ- 
ing 46K torrents and 29M users. The studies go signifi- 
cantly beyond any previous work by combining per-node, 
per-torrent and system-wide observations to ascertain the 
causes, characteristics and repercussions of file unavail- 
ability. The study confirms the conclusion from previous 
works that seeders have a significant impact on both per- 
formance and availability. However, we also present some 
crucial new findings: (i) the presence of seeders is not the 
sole factor involved in file availability, (ii) 23.5% of nodes 
that operate in seedless torrents can finish their downloads, 
and (iii) BitTorrent availability is discontinuous, operat- 
ing in cycles of temporary unavailability. Due to our new 
findings, we consider it is important to revisit the solution 
space; to this end, we perform large-scale trace-based sim- 
ulations to explore the potential of two abstract approaches. 

1 Introduction 

BitTorrent |3 | has become a de-facto standard for scal- 
able content distribution over the Internet. The reason for its 
success is its ability to efficiently leverage the uplink capac- 
ity of nodes whilst achieving high scalability during peak 
demands ifTOlfTSl . This efficiency is largely attributable to 
BitTorrent's tit-for-tat mechanism, which encourages users 
to share their resources whilst downloading files. 

Despite the success of BitTorrent, it still suffers from a 
significant problem: the long term availability of content. 
More specifically, content that is distributed using BitTor- 
rent often becomes unavailable after a relatively short pe- 
riod of time. For example, |7| found that the available lifes- 
pan of most torrents is between 30-300 hours whilst 10% of 
all users fail to successfully download their desired content. 

A file can be considered unavailable if one or more of 



its data pieces are inaccessible to users wishing to down- 
load it. The most intuitive reason for this occurrence is that 
previously successful users in possession of the entire file 
(seeders) have left the system leaving only users that pos- 
sess a subset of the file (leechers). Subsequently, unavail- 
ability occurs when this subset cannot collectively rebuild 
the complete file with their remaining pieces. Previous re- 
search (such as |[8l||7||fT3 1) has promoted the importance of 
seeders in regard to availability and concluded that a seed- 
less torrent is unable to reconstruct the file. However, this 
conclusion is challenged by the observation that some tor- 
rents continue to effectively serve files despite lacking any 
seeders. 

In this paper, we devote our attention to understanding 
and characterizing BitTorrent's file unavailability problem. 
We strive to discover the scale, causes and repercussions 
of the problem alongside investigating the possible solution 
space. To achieve this we have performed two large-scale 
measurement studies; the first investigates BitTorrent on a 
macroscopic level by periodically probing over 46K torrents 
to ascertain their high level characteristics, such as swarm 
size and seeder/leecher ratio. Whilst, the second study in- 
vestigates BitTorrent on a microscopic level by contacting 
over 700,000 individual peers in 832 torrents to discover 
relevant properties such as their download rates and piece 
availability. To the best of the authors' knowledge, this is 
the largest dataset in terms of size and collected informa- 
tion used to investigate file availability in BitTorrent. This 
allow us to extend previous works to obtain far more accu- 
rate results; through this we make a number of interesting 
findings, 

• In 86% of cases, leechers are unable to reconstruct files 
in the absence of seeders. However, in 14% of cases, 
leechers can reconstruct the file without any seeders 
present. We therefore discover that seeders are not the 
sole factor involved in BitTorrent's unavailability prob- 
lem. Such torrents achieve this through the posses- 
sion of large and stable populations as well as high ag- 
gregate download rates that enable leechers to quickly 
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replicate rare chunks. 

• In 64% of torrents, unavailability is not immutable 
and, instead, occurs in cyclic periods followed by reoc- 
curring availability. This is due to old seeders returning 
to swarms where they previously participated in. 

• The combination of the two previous observations re- 
sults in 23.5% of users affected by a lack of seeders 
actually being able to complete their downloads. 

• Users often become frustrated with unavailable tor- 
rents that exhibit poor download rates. We observe a 
chain reaction in which such users abort their down- 
loads thereby exacerbating unavailability, resulting in 
further abortions. 

These new findings make it crucial to revisit the solution- 
space to investigate behaviour under the new, accurate 
workload defined by our large scale dataset. As such, we 
perform trace-based simulations looking at both traditional 
single-torrent and cross-torrent mechanisms approaches to 
solving the file unavailability problem; our primary results 
are, 

• Single-torrent incentive mechanisms must encourage 
users to increase the average seeding time to 10 times 
more than the current average to achieve 99% avail- 
ability. 

• Cross-torrent incentive mechanisms can easily achieve 
99% availability but with a performance decrease of 
22% for 56% of the users. 

The rest of the paper is structured as follows; Section 
2 provides related work. Section 3 then details the problem 
and our measurement methodology. Following this. Section 
4 characterises the causes and impact of unavailability. We 
discover a primary cause is a lack of seeders and therefore 
Section 5 investigates seedless states in BitTorrent. Next, 
we utilise our measurement study data to explore the poten- 
tial solution space with trace-based simulations in Section 
6. Finally, we conclude the paper in Section 7. 

2 Related Work 

BitTorrent Measurements: BitTorrent measurement 
studies can be classified into two different groups. The first 
type uses log traces from trackers lITOl [8] |7] [1| whereas the 
second type relies on crawling techniques to retrieve the in- 
formation from the system [[H [HI EH [H El ■ The first 
type of measurements is less intrusive since they do not ac- 
tively interfere with the system. However, they are often 
problematic to obtain since they require the agreement from 
content providers. The crawling techniques, on the other 



hand, can be divided into two categories. In its simplest 
form, a crawler exploits the BitTorrent protocol to period- 
ically request the IP addresses of the clients participating 
in the torrent from the tracker ||2TI . This makes it possi- 
ble to study the demographics and dynamics of the torrents 
under analysis. This is what we name macroscopic crawl- 
ing. More sophisticated crawlers also contact the clients 
and retrieve detailed information such as the client ID and 
their piece bitmap. We name this microscopic crawling. Al- 
though the microscopic crawling gives more detailed infor- 
mation, it is noticeably less scalable and only allows a few 
thousand torrents to be studied in parallel 1 1 8 , 1 9 1 . Each ap- 
proach is effective for addressing particular needs; however, 
these have not yet been combined to investigate BitTorrent 
in a holistic way. 

Bit Torrent's File Availability Analysis: There are only 
a few works investigating availability issues in BitTor- 
rent systems E] [15] [T3j| . Neglia et al. mainly study the 
tracker/DHT availability of 22,000 torrents obtained from 
two torrent indexing sites ifTSll . Guo et al. (7) extended 
this, to model the lifespan of torrents by analyzing a limited 
number of tracker traces from fTSj; it was found that most 
torrents are short-lived because of an exponentially decreas- 
ing peer arrival rate. This model starts from the basis that 
content is unavailable when there are no seeders present in 
the swarm. This is, so far, an unverified hypothesis that is 
important to investigate. Similarly, Menasche et al. also 
use this hypothesis to investigate the availability of seeders 
in 45,000 torrents obtained from the Mininova website 1 13|, 
finding that 40% of swarms lack seeders for more than 15 
days in the first month after the torrent's birth. 

Improving BitTorrent File Availability: Surprisingly, 
little research work has been performed into addressing the 
file availability in BitTorrent |(8][l3]|20l. The most recent 
work improves file availability problem in BitTorrent by file 
bundling to enlarge the online times of the users. Using 
a queuing theoretic model and controlled experiments on 
PlanetLab, the authors show that this approach can reduce 
waiting-time for peers in torrents with highly unavailable 
seeders. However, their results consider that peers arrive 
in a constant Poisson process which is a strong assumption 
given the measurement results presented in US. |8J and also 
in this paper. 

Guo et al. were the first to propose intriguing ideas 
and results for cross-torrent collaboration. Amongst other 
things, the authors sketch an abstract mechanism for instant 
inter-torrent collaboration; following this they also evaluate 
the principles. Yang et al. propose a variation of these ideas 
by designing a cross-torrent tit-for-tat strategy that assumes 
repeated interactions of the users. However, this method 
suffers because as Piatek et al. show through extensive mea- 
surements, 91.5% of peer pairs that occur in a single swarm 
will never meet again at any later point in time fTJl . Pi- 
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atek et al. subsequently propose an alternative protocol that 
enables long-term incentives in BitTorrent with the aid of 
one-hop intermediaries. 

3 Problem Background and Methodology 

3.1 Defining File Availability 

To study and understand the availability of files in Bit- 
Torrent we first present a simple model. Let's assume that 
we have a torrent T, formed by N nodes, managing the 
download of a file composed by P pieces. Thus, we can 
define the vector Vi — \Vii, Vi2, Vip] that contains the 
information about the pieces stored by peer i: Vij = 1 if 
node i has the piece j; Vij = if node i does not have piece 
j. Vi is typically known as the bitfield of node i. 

We define the Percentage of Available Pieces of torrent 
T at a time instant t as 

UiT) = p (1) 

Where OR{Vij) represents the logical (9/?-operation 
over the piece j across all the nodes in the torrent T. 

3.2 The Circumstances of Unavailability 

It is important to understand in which circumstances a 
file becomes unavailable, based on our definition. A file is 
considered unavailable if at least one of its pieces is not ac- 
cessible within a swarm. This situation arises if there are 
no peers in the swarm that possess a given piece or, alterna- 
tively, if the peer(s) that possess the piece are inaccessible 
(e.g. due to firewalls, NAT or overlay graph disconnection). 
It is intuitive to consider the former as a far more likely 
circumstance (e.g. most BitTorrent clients implement tech- 
niques such as NAT traversal fT?]. Moreover, they include 
neighbors discovery techniques such as the Peer Exchange 
Protocol -PEX- and periodical tracker polling that prevent 
graph disconnection). Therefore, given this assumption, a 
file can be considered available if (i) there is at least one 
seeder or (ii) there is no seeder but the bitfields of the leech- 
ers collectively fit the condition U{T) — 1. Without de- 
tailed analysis, we can therefore currently state that: 

• With an accessible seeder, a file is available 

• Without an accessible seeder, a file may be available 

This paper uses these two observations as a starting point 
to investigate unavailability in BitTorrent. In the following 
sections, we denote time periods in a torrent's lifecycle in 
which no seeder is online as a seedless state. To this end, 
the file is unavailable if torrent T is in seedless state and 
U{T) < 1. 



3.3 Measurement Methodology 

To study the unavailability problem and specifically the 
seeders' role in it, we have performed two large-scale mea- 
surement studies using microscopic and macroscopic crawl- 
ing. To the best of our knowledge, this paper is the first to 
combine both microscopic and macroscopic crawling tech- 
niques to better understand BitTorrent (specifically BitTor- 
rent's file availability). 

Microscopic Crawling: To truly understand unavailabil- 
ity in BitTorrent, it is necessary to be able to view the micro- 
scopic characteristics of any given swarm, e.g. piece distri- 
bution or nodes' download rates. Without this, one can only 
get a rough estimation of availability using metrics such as 
the number of seeders. The information regarding the be- 
haviour of individual peers provides the necessary data to 
make new, more accurate findings. To gain this information 
we developed and deployed a distributed BitTorrent crawler 
that can investigate swarms on a microscopic level, using 
20 nodes in the Emulab testbed f5\. 

The crawler operated from July 18, 2009 to July 29, 
2009 (micros -1) and then again from August 19, 2009 
to September 5, 2009 (micros-2). To discover all the on- 
line users in a torrent it periodically contacted the torrent's 
tracker as well as using the Peer Exchange Protocol (PEX). 
From every peer, every 10 minutes it requested their piece 
bitmap to discover the real-time distribution of pieces. For 
the micros-l study, the crawler followed 255 torrents ap- 
pearing on Mininov^ after the first measurement hour; in 
these torrents, we observed 246,750 users. The micros-2 
dataset contains information from 577 torrents and 531,089 
users. 

Macroscopic Crawling: The microscopic measurements 
provide detailed insight into the distribution of pieces and 
download rates within the swarm, as well as between differ- 
ent peers. However, due to scalability issues it is difficult to 
perform such detailed measurements on a very large-scale 
(e.g. several thousand torrents). To complement these re- 
sults we therefore also implemented a higher level crawler 
that followed every torrent published on the Mininova web- 
site after December 09, 2008 for a period of 38 days. This 
crawler periodically requested, from multiple sites in Eu- 
rope, tracker information regarding each torrent's number 
of seeders and leechers alongside the members' ip addresses 
(we were able to systematically collect 98% of all the ip ad- 
dresses from within the swarms). This study allowed us to 
gain an extremely large number of measurements regarding 
details such as peer arrival patterns, seeder/leecher ratios 
and torrent sizes. This information can subsequently be cor- 
related with our smaller-scale microscopic measurements 
to derive such things as the scale of seedless states and 
the causes for seedless states occurring. Our final macro- 

'The largest BitToiTent Community based on Alexa Ranking. 
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scopic dataset consisted of reports from 46,227 torrents and 
29,066,139 users. 

4 Characterising Unavailability: Causes and 
Impact 

In this section, we first investigate the role that seeders 
play in file unavailability. Following this we study the ex- 
ceptions and variations we discovered. Lastly, we then in- 
vestigate the real-time impact that a lack of seeders has on 
client performance and their subsequent reactions that can 
be observed. 

4.1 Investigating the Role of Seeders in 
File Unavailability 

It is intuitive to think that U(T) < 1 in a torrent with- 
out any seeder (that is, leechers are unable to reconstruct 
the file). However, this is, so far, an unverified assumption 
that must be investigated (and quantified). To ascertain this, 
we inspect the (i) nodes' bitfield and (ii) nodes' download 
rates in all the torrents of our microscopic traces affected by 
seedless states. 

4.1.1 Bitfield Analysis 

We have collected every nodes' bitfields for all the torrents 
in our microscopic measurements as they have evolved over 
time. For each torrent we have computed U{T) periodically 
every 10 minutes during any period a torrent is without any 
seeders (i.e. it is in a seedless state). This allows us to ascer- 
tain whether a full copy of the file exists in the torrent at any 
given time. Fig. [TJshows the CDF of max{U{T)) observed 
in the seedless state for each torrent that we studied. From 
this data, we can extract two pieces of information; first, 
in the majority of cases (86%) our hypothesis is confirmed 
and the contacted leechers are unable to collectively recon- 
struct the file once a seeder has left (i.e. max{U (T)) < 1). 
Clearly, this means that seeders do have a significant im- 
pact on the availability of files in BitTorrent. Importantly, 
however, we also find that a notable proportion of torrents 
(14%) actually remain available even without a seeder Col- 
lectively, this makes up 24% of all leechers that operate 
in seedless swarms. This is a crucial finding that has not 
been observed before; it is therefore in contrast with previ- 
ous models |[T3] [H that consider all seedless torrents to be 
unavailable. 

4.1.2 Download Rate Analysis 

A limitation of the bitfield analysis is that not all nodes are 
accessible due to NATs. To address this, we also inspect 
the aggregate torrent download rates. Through this, we can 
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infer that a file is unavailable when the download rate of all 
the peers participating in a specific torrent drops close to 
KBps. From this we can derive that the node cannot find 
any new pieces to download. 

To highlight our findings, we first inspect a representa- 
tive torrent from our microscopic trac^ shown in Fig. [2] 
The figure shows the median instant download rate of the 
online leechers over time, sampled every 10 minutes. It 
also plots the number of seeders and leechers, as well as 
the number of copies of the least replicated piece. Note that 
when the number of seeders becomes 0, the torrent enters a 
seedless state. 

The torrent can be observed to enter a seedless state af- 
ter the middle of day 3, remaining in this state for roughly 
two days. When the final seed departs the download rate of 
the leechers drops to approximately 0-3 KBps after only a 
few minutes. This also coincides with the number of least 
replicated pieces dropping to zero. It can therefore be con- 
fidently inferred that the file is, indeed, unavailable during 
this period due to the departure of the last seeder 

Interestingly, it can also be seen that the torrent be- 
comes available again during day 5. As the seeders return, 
the download rate increases and the file becomes available 
again. In contrast to past assumptions, it is therefore evi- 
dent that unavailability is not continuous. This important 



phenomenon will be investigated further in Section 5.3 



The above analysis has inspected a representative tor- 
rent. To validate its widespread applicability we also look 
at the download rate degradation in all torrents. To achieve 
this, we have taken all the users that have been affected by 
a seedless state and separated their downloading time into 
two periods: (i) periods in which they have suffered from 
a seedless state and (ii) periods in which they have not. 
Fig. [3]presents the download rate distribution for both pe- 
riods. First, we can observe that the download rate in a 



^We have observed the same behaviour in most of the torrents affected 
by seedless states. 
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Figure 2. Snapshot from a torrent in our mi- 
croscopic trace. 

non-seedless state is much higher than in a seedless state. 
80-85% of the nodes experience an average download rate 
lower than 1 KBps when in a seedless torrent, indicating 
that the peers cannot locate any required pieces and the file 
is, indeed, unavailable. Second, however, we also observe 
that 15-20% of users, in fact, maintain a reasonable level 
of performance even without any seeders. This can be at- 
tributed to two reasons: (i) the aforementioned 14% of tor- 
rents are capable of reconstructing their file without a seeder 
at an average rate of 21.3 KBps; and (m) newly joined peers 
can download the subset of available pieces at an effective 
rate. This can be observed in the representative torrent (cf. 
Fig. |2]): between days 4 and 5 there is a peak in the number 
of leechers which results in a short peak in the download 
rate as new comers download the available pieces. 

4.2 Investigating the Causes of Swarm 
Resilience 

The previous section has identified a notable percentage 
(14%) of torrents that can maintain availability even with- 
out any seeders; this represents 24% of all leechers that en- 
counter seedless states. 
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Figure 3. Interval download rates for nodes 
affected by the lack of seeders. 



To investigate this, we separate torrents into those that 
survive in the absence of seeders {resilient torrents) and 
those that do not (susceptible torrents). We then investi- 
gate quantitative properties of these two groups to ascertain 
how they differ at various points in their lifecycles. All of 
the identified metrics have been calculated for each group 
(across all member torrents) every 10 minutes using infor- 
mation from the microscopic traces and the tracker reports. 
These values have then been averaged together over each 
time period investigated. 

Table[T]gives an overview of all metrics used in this anal- 
ysis. We calculate these over two time periods: the begin- 
ning of the torrents' lifecycle and just before the last seeder 
goes offline. Although not included in the table, we also 
investigated the effects of file size and content type with- 
out ascertaining any correlation. Most metrics are straight- 
forward, however, two require some explanation: Distribu- 
tion Entropy (E{T)) and the Churn Factor (CF). 

The E{T) investigates the distribution of pieces within 
the swarm; this is to investigate whether torrents that can 
survive achieve a superior distribution of pieces. We there- 
fore characterize the distribution entropy in a torrent T at a 
given time t by introducing the following Entropy Index: 



E{T) = 



(2) 



Recall that N defines the number of nodes in the swarm, 
P is the number of pieces a file is composed of and Vi is the 
bitfield of node i. This index is similar to Jain's Fairness 
Index fTT) and achieves a value of 1 if all pieces are equally 
distributed among the peers. 

The Churn Factor CE investigates whether torrents that 
can survive have more stable populations. This factor is 
defined by Ndisc/Naii where Ndisc is the number of users 
that have left the swarm during a given time period {t) and 
Nail is the total number of users observed during this same 
period. A factor of indicates that no user disconnected 
within t\ by default t — \Q mins. 
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Metric 




Time before seedless state 






Time after torrent's birth 




1 hour 


6 hours 


6 hours 


24 hours 


Resilient 


Susceptible 


Resilient 


Susceptible 


Resilient 


Susceptible 


Resilient 


Susceptible 


Swarm speed (in KBps) 


58.83 


23.88 


68.58 


24.99 


95.72 


53.50 


62.99 


44.82 


Seeder/Leeclier ratio 


0.15 


0.03 


0.14 


0.04 


0.19 


0.19 


0.32 


0.43 


Firewalled/NATed peers (in %) 


70.86 


61.09 


62.30 


60.67 


51.30 


56.05 


54.94 


58.44 


Distribution Entropy E{T) 


0.93 


0.94 


0.92 


0.93 


0.91 


0.90 


0.92 


0.91 


Least replicated piece (# of copies) 


9.21 


1.61 


8.38 


2.82 


15.34 


14.21 


21.33 


23.55 


Cliurn factor CF 


0.03 


0.21 


0.08 


0.15 


0.07 


0.11 


0.08 


0.07 


Online leechers 


281.34 


111.61 


250.48 


101.55 


134.05 


82.71 


163.12 


89.55 


Online seeders 


5.28 


1.51 


6.11 


1.80 


12.05 


11.25 


18.17 


21.34 



Table 1. Characteristics of resilient torrents (those that maintain availability in seedless state) and 
susceptible torrents (those that cannot reconstruct the file). 



From the data in Table [T] we can make the following 
important observations, 

• Torrent Popularity: From the beginning, resilient tor- 
rents exhibit higher leecher population sizes. Larger 
torrents possess an increased probability of replicating 
rare pieces before the loss of seeders. 

• Low Churn Factor: High chum in small torrents cre- 
ates a greater risk of losing vital pieces; if this coin- 
cides with the loss of a seeder then it becomes impos- 
sible to recover these pieces again until a seeder re- 
tums. Resilient torrents have significantly lower churn 
factors than susceptible torrents. 

• Seeder/Leecher Ratio: Resilient torrents exhibit a 
higher seeder/leecher ratio and, as a derivative of this, 
experience download rates that over twice as high 
as susceptible torrents. This superior performance is 
highly beneficial for the survival of piece replicas as 
it allows the quick duplication of rare pieces. Be- 
fore seedless state occurring, resilient torrents there- 
fore have many more replicas of the rarest piece when 
compared to susceptible torrents. 

In summary, these results show that swarm resilience is a 
product of large, stable populations that can achieve higher 
download rates due to beneficial seeder/leecher ratios. The 
combination of these factors results in rarest piece replica- 
tion rates that are over 5 times greater than their suscepti- 
ble counterparts. This makes such swarms highly resilient 
to the loss of any seeders. Importantly, it also can be con- 
cluded that unavailability cannot be addressed by modifying 
any of BitTorrent's algorithms (e.g. piece selection) but, in- 
stead, must be solved by incentivising users to modify their 
behaviour This is exemplified by the lack of any correlation 
between resilience and distribution entropy. 

4.3 Effects and Trends of Seed Departure 

Despite a notable percentage of torrents surviving with- 
out seeders, it is evident that the loss of all seeders can often 
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Figure 4. Comparison of download perfor- 
mance between peers affected and not af- 
fected by seedless states. 

result in unavailability. This section now investigates the ef- 
fects that this has on both individual users and wider system 
performance. Three stages can be identified which we now 
discuss. 

The first repercussion of the loss of seeders in suscepti- 
ble torrents is a significant and rapid drop in download rates. 
To extend the earlier analysis, we now compare the average 
download rate of users that suffer from a seedless state at 
some point during their download against the average down- 
load rate of users that always find content available. We first 
categorise users into two groups: affected vs. non-affected. 
The first group of users consists of leechers that are (at some 
point) affected by a lack of seeders. The non-affected users, 
on the other hand, have at least one seeder available dur- 
ing their entire download. Fig. |4] gives the download rate 
distribution of both user groups as obtained from the two 
microscopic crawlings. Whereas the median download rate 
for the non-affected users is 36 KBps in micros-1 and 48 
KBps in micros-2, the performance for peers attempting 
to download unavailable content is only 0.06 KBps and 3.8 
KBps, respectively. 

The second observable stage is a direct derivative of the 
decrease in download performance. Specifically, we ob- 
serve a large increase in download abortions. To study this 
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we examine the session times in our microscopic traces. We 
observe that 89% of users affected by file unavailabiHty (i.e. 
participating in susceptible torrents) abort their downloads 
due to the bad performance. Sadly, this is an unnecessary 
action as we have found that seeders often return, making 
files available again. In contrast to these results, users op- 
erating in resilient torrents only have an abortion rate of 
34.47%. Although this seems initially high, we also find 
that many users operating in other torrents that do not suffer 
from unavailability also abort their downloads. On closer 
inspection, these 'unnecessary' abortions occur in torrents 
that have particularly low download rates that are under a 
third of the average. 

The third stage in this process is the worrying emergence 
of a chain reaction. We find that as the number of abortions 
increase, the number of available chunks decrease. This re- 
sults in an exacerbation of the torrent's unavailability and a 
further drop in download rates for those trying to access the 
remaining chunks. As other users witness this trend, they 
too abort their downloads. This process results in fewer 
users becoming seeders and therefore greater unavailabil- 
ity and more abortions. Frequently, the above two repercus- 
sions of unavailability and the creation of this chain reaction 
often spells the end for a torrent. 

From these findings we derive that users are highly sen- 
sitive to their perceived instant quality of service and there- 
fore any solutions must maintain an acceptable download 
rate whilst also improving file availability. 

5 Characterising Seedless States 

The previous section has validated and quantified the im- 
portance of seeders in regard to file availability in BitTorrent 
and discussed under which circumstances a file of a seedless 
torrent becomes unavailable. It has been found that in the 
majority of torrents (86%), the loss of all seeders results in 
unavailability. In this section we therefore investigate the 
behaviour of seeders and characterise the nature of seed- 
less states using our large scale dataset. We first look at the 
frequency of seedless states in BitTorrent. Following this 
we investigate the causes of seedless states before, finally, 
investigating the issue of why torrents can become revived 
again after extended periods of unavailability. 

5.1 How Prevalent are Seedless States? 

To quantify how prevalent seedless states are in BitTor- 
rent, we ask the following question: how many torrents and 
to what extent are torrents affected by seedless states?. To 
answer this, we use the logs from our macroscopic trace that 
give us a large-scale view on the system comprising of 46k 
torrents. 



Availibility 
Period 



Unavailibility 
Period 



Figure 5. Illustration of a seedless state. 

The measurements show that more than 38% of torrents 
(17,568 out of 46,227) lose their seeders within the first 
month, out of which 72% lack seeders after only 5 days. 
Similarly, we find that more than 45% of the torrents suffer 
from a lack of seeders for half of their monitoring time. To 
exemplify the scale of this, in 50% of the torrents observed 
for periods longer than 30 days, no seeder was available for 
more than 16 days. 

Finally, in our study, more than 9.68 million users (33% 
of all users seen) participated in torrents with highly un- 
available seeders suggesting that this is not only a long tail 
problem. Out of these users, more than 1.59 million were 
directly affected by seedless states. 

5.2 Why do Seedless States Occur? 

Since seedless states are highly prevalent in real swarms, 
an intuitive question is: why do they occur in the wild? In 
this section, we first identify and then further investigate 
the influencing factors responsible for triggering seedless 
states. 

5.2.1 Identifying Influencing Factors 

There are two main factors that directly influence the exis- 
tence of seedless states: (z) the session time of seeders and 
(ii) the inter-arrival rate of the users. To illustrate the influ- 
encing factors, we use a simple example shown in Fig. |5] In 
this figure, each horizontal line represents the lifetime of a 
user; these users can either be in a leecher state (thin lines) 
or a seeding state (thick lines). 

It seems straightforward that the longer a seeder serves 
content, the more leechers are able to finish their down- 
loads. Unfortunately, (as demonstrated later on) the seeding 
time is typically quite short contributing significantly to the 
frequency and length of seedless states. 

Let's now assume that user n is the last available seeder 
in our example torrent and none of the previous seeders re- 
turn to the torrent. In this case, a seedless state occurs when 
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the time required for leechers to download the file exceeds 
the online time of the last seeder. For example, Fig. [5]shows 
that after the last available seeder leaves the swarm at time 
ts, none of the remaining leechers were able to finish the 
download. If we focus on the n-th node and its subsequent 
successor in the torrent (n + 1-th), the inter-arrival time be- 
tween both users is given by r„+i(= t2 — ii) whereas the 
seeding time of node n is given by /i„. Assume that both 
users n and n+1 download a file of size Fg with rate Z?„ and 
Dn+i respectively. Thus, the swarm enters a seedless state 
when Eq. [3]is fulfilled. 

DnFs + fin < Tn+1 + D^+lFs (3) 

To simplify the analysis, we assume that L)„ = £)„_!_ 
In this case, the seedless state is reached if the inter-arrival 
time is larger than the seeding time. 

To summarise, seeding times as well as inter-arrival 
times play an important role in the generation of seedless 
states and subsequently in the long-term availability of con- 
tent. Since both parameters are not directly correlated, we 
individually analyse both of them in the following. 

5.2.2 Arrival Behaviour of Users 

The first behavioural characteristic that is paramount to 
seedless state generation is the inter-arrival times of users. 
In this regard, intuitive questions are: (i) what inter-arrival 
times do we expect in reality and (ii) how do inter-arrival 
times evolve over time? 

By analysing a few hundred torrents in a small com- 
munity, previous work [T] has shown that user inter-arrival 
times are exponentially increasing. Our goal is to generalize 
this finding for 'open' communities such as Mininova.org 
that are orders of magnitude larger. For our analysis, we use 
similar techniques as applied in |7 |. We consider all torrents 
in our macroscopic trace. We use linear regression to fit the 
logarithm of the complementarjj^of the number of node ar- 
rivals of each torrent along time. Let Xt denote the comple- 
mentary number of node arrivals at time epoch t and Yt be 
the fitting result. We define the relative deviation of the ac- 
tual node arrivals over an ideally exponentially increasing 
function by '°gf'~'°g^' . Thus, a relative deviation of 0% 

logXi ' 

indicates that both curves overlap. Fig. |6] shows the devia- 
tion for each torrent of our macroscopic trace. The x-axis 
depicts the torrents ordered by ascending population size 
while the y-axis shows the relative deviation. For most of 
the torrents, the relative deviation is less than 10% whereas 

^Our microscopic measurements sliow tliat the download rate of users 
that finish downloads (Dn in the example) is higher than the download rate 
of those that do not {D^+i) validating our assumption. 

"^We use the complementary number of node arrivals to avoid domains 
in which the logarithm is undefined, e.g., epochs with no peer anivals. 
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Figure 6. Deviation from linear regression. 
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Figure 7. IVIaximum inter-arrival times of tor- 
rents with highly unavailable seeders. 

the deviation tends to decrease with increasing torrent pop- 
ularity. Altogether, the average relative deviation of all tor- 
rents is 4.8%. Therefore, we conclude that the inter-arrival 
time of the nodes exponentially increases with time. 

Notably, we observed especially high inter-arrival times 
in torrents affected by seedless states; this is in line with our 
analysis in the previous section. For instance. Fig. |7] plots 
the maximum inter-arrival time observed in these torrents 
with unavailable seeders. More than 45% of the torrents 
exhibit inter-arrival times far beyond 10 hours. 

5.2.3 Seeding Times of Users 

The second behavioural characteristic that is paramount to 
the creation of seedless states is the seeding time of a node, 
i.e. how long seeds stay online for. As already shown in 
our example torrent (cf. Fig. |5]l, to maintain file availability 
it is necessary for seeders to remain online for long enough 
for new seeds to be generated. Fig. [8] shows the cumula- 
tive distribution of the seeding times of the nodes obtained 
from the two microscopic measurements. It can be seen 
that seeding times are generally short-lasting with 75% of 
the seeders staying online for less than 4 hours. When this 
data is compared to the inter-arrival time of users it can be 
identified that the current seeding times in BitTorrent are not 
sufficient to avoid seedless states, thus preventing to achieve 
long-term file availabihty in BitTorrent. 

5.3 How long are Seedless States? 

The representative snapshot presented in Fig|2]has high- 
lighted that torrents can become available again after a ex- 



8 




' • ' ' ' 

2 4 6 8 10 12 14 16 

Session seeding time (in hours) 



Figure 8. Seeding time distribution. 

tended periods of unavailability. In this section we validate 
(using both the macroscopic and microscopic datasets) that 
file unavailability is, in fact, discontinuous with reoccurring 
periods of temporary availability. Through our measure- 
ment studies we can state that this occurs because seed- 
ers often return to swarms that they have previously par- 
ticipated in. This allows the 1 1 % of users that choose to 
remain online during periods of unavailability (i.e. in sus- 
ceptible torrents) to eventually complete their downloads. 
Alongside the existence of resilient torrents, this means that 
23.5% of all leechers affected by seedless states can actually 
still gain access to the file. 

The reoccurrence of seeders happens in over 64% of tor- 
rents that suffer from seedless states in our macroscopic 
study. To investigate this. Fig. |9] shows the CDFs of both 
the duration of seedless states as well as the duration of 
the subsequent periods in which content becomes avail- 
able again, computed over all torrents exhibiting this phe- 
nomenon. Note that the x-axis is in log scale. It can be ob- 
served that seedless periods are typically long-lasting with 
an average of 43. 19 hours whereas the subsequent availabil- 
ity periods only last 12.56 hours on average. 

The primary reason for the (seemingly) altruistic return 
of seeders is likely to be the default settings of many BitTor- 
rent clients (e.g. Vuze, /iTorrent) that automatically rejoin 
torrents at their start-up even after a user has completed a 
file download. Unfortunately, BitTorrent users do not have 
permanent identifiers and thus we cannot make quantitative 
statements on exactly how many unique seeders rejoin a 
swarm and over what time period. However, the length of 
the seedless periods as depicted in Fig. [9] offers a conser- 
vative bound for the inter-seeding time distribution of such 
users. This is obviously very coarse grained and therefore 
we expect the inter-seeding times actually to be higher. 

The reoccurrence of seeders is obviously in contrast to 
previous work that has assumed unavailability is continuous 
and immutable. To investigate the impact that this finding 
has on previous work, we briefly look at the relative devi- 
ation that the assumption has when compared against our 
dataset. We define the relative deviation as. 
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Figure 9. Length of cycles of temporary avail- 
ability and unavailability periods. 



, . ... measured avail, time — assumed avail, time 

relative deviation = 

assumed avail, time 

(4) 

We find that in approximately 35 % of the torrents in our 
macroscopic dataset, the assumption works well. In these 
torrents, we did not observe any temporary availability pe- 
riod after the torrent first enters a seedless state. However, 
in 50% of the torrents, the content is actually available for 
at least twice as that assumed when considering immutable 
unavailability. 

6 Improving File Availability 

The previous sections have outlined the file availability 
problem and highlighted the significant impact that seeders 
have on this. We therefore deduce that a solution must find 
some way to encourage users to provide content even af- 
ter they have obtained it themselves i.e. to prevent seeders 
from leaving torrents. To do this we highlight two possible 
approaches: single-torrent and cross-torrent mechanisms. 
The principles of these two approaches are first abstractly 
outlined to show how each might improve seeding times. 
Following this, the two approaches are evaluated consider- 
ing the key findings described in previous section (e.g. reoc- 
currence of seeders). For this purpose, we run trace-based 
simulations using the workload from our large scale dataset. 

6.1 Potential Solution Approaches for Ex- 
tending Seeding Times 

This section briefly outlines the two generic approaches 
that can be taken for improving seeding in BitTorrent. The 
first is using traditional single-torrent principles whilst the 
second exploits the concept of cross-torrent collaboration 
(originally outlined in |7|). Note that we do not offer con- 
crete implementational details; instead, we provide a brief 
outline of the principles behind each mechanism. 
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6.1.1 Single-Torrent Solution 



6.2.1 Evaluative Aims 



A single-torrent solution involves incentivising users to re- 
main within a torrent to seed, based on certain properties 
related to that individual torrent. As of yet we do not know 
of any successful mechanisms to achieve this due to the dif- 
ficulty of enforcing incentives once a peer has already ob- 
tained the file which it desires. We therefore consider a sim- 
ple framework of encrypted pieces that may work. Such a 
solution would involve encrypting the file before it is dis- 
tributed within the swarm. The tracker would be responsi- 
ble for managing this encryption and, as such, would be the 
source of the keys. Subsequently, once a peer has down- 
loaded the file it would be required to remain seeding for a 
length of time determined by the tracker before the encryp- 
tion keys are released to it. 

6.1.2 Cross-Torrent Solution 

A cross-torrent solution involves incentivising users to 
cooperate with the system as opposed to individual tor- 
rents. This approach is motivated by observations from our 
macroscopic trace that shows 5 1 % of the users join multi- 
ple torrents (4.98 on average). We have further found that 
seeders frequently rejoin swarms after they have left, there- 
fore providing conclusive evidence that the same peers re- 
join the BitTorrent system multiple times whilst still pos- 
sessing their previously downloaded files. To highlight the 
principles of a cross-torrent solution, imagine a user who 
joins torrent X at some point in time and completes the 
download; this user may very well join another torrent Y 
at a later point in time. When the node comes online again 
to download torrent Y it could then theoretically persists as 
a replica for torrent X. 

The incentives behind this could be managed in a number 
of ways (e.g. fl7 1). The following example highlights how 
the system could work using persistent contribution histo- 
ries. Through this approach, the system would maintain 
a history of the contributions made by each user (agnostic 
to which torrent the contribution is made). Subsequently, 
peers would show preference to piece requests from users 
with higher contribution ratios. This would therefore re- 
place BitTorrent's current rate-based tit-for-tat mechanism 
so that incentives were based on the entire system as op- 
posed to individual users and torrents. 

6.2 Experimental Methodology 

To evaluate the two possible solutions approaches, the 
BitTorrent simulator of Bharambe et al. |2 1 is used and ex- 
tended to enable the simulation of multiple torrents existing 
in parallel. 



We do not aim to perform an implementational comparison 
between vanilla BitTorrent and the proposed approaches, 
e.g., regarding protocol overhead and technical aspects to 
realize either approach. This is out of the scope of this pa- 
per The goal of our evaluation is to shed light on the fea- 
sibility and potential of the two approaches based on the 
newly discovered observations from our studies. 

For both approaches we wish to discover, (i) does the 
approach increase file availability in torrents with ordinar- 
ily unavailable seeders, and what are the implications of 
this in regard to download performance. We aim to inves- 
tigate these factors on both a per-torrent and system-wide 
basis to explore how the effects of the approaches impact 
both perspectives on BitTorrent. 

6.2.2 Input to the experiments 

Selecting the Torrents: Our trace data encompasses tens 
of thousands of torrents over a period of several weeks, far 
more then the simulator is able to handle. Hence, we chose 
a random subset of 100 torrents from the set of torrents af- 
fected by seedless states with varying file sizes between 3- 
1500 MB and a per-torrent monitoring period of at least four 
weeks|^ The logs of these torrents contains data of more 
than 235,000 downloads. 

User behavior: To model the access pattern of torrents, we 
do not use any artificial peer arrival function. Instead, we 
bring up new peers as well as reoccurring seeders accord- 
ing to the trace logs. To model the number of swarms that 
a peer joins we calculate the probability distribution over 
our entire data set. Any user that cannot download the file 
within 36 hours aborts the downloacj^ Finally, after fin- 
ishing their downloads, users stay online as seeders based 
on the measurements from our microscopic crawlings (cf 
Fig.|8j. 

Speed distributions: To have a representative band- 
width distribution, we first associate each IP address with a 
country, using a freely available geolocation database Llfll . 
Based on the country of origin, the Ookla database lfT6l 
provides us with the median down/upUnk capacity of each 
use£] 

Failures in contribution histories: To represent informa- 
tion inconsistencies in the distribution of contribution his- 
tories (e.g. due to churn), when encountering a new user in 
the cross-torrent approach, the contribution history is only 

^We have also experimented with higher/smaller amount of torrents. 
Due to space constraints, we opt for presenting only a representative sam- 
ple. 

*We find through simulations that 36 hours is enough time to get a 
download success ratio over 99% in the presence of seeders for all access 
links and file sizes used in our experiments. 

'We have also experimented with other datasets l9j |4j| and obtained 
similar results. 
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Protocol 




Metric 






Avg. seeding 




Fs 




time (in hours) 


(in KBps) 


(in %) 


BT: Vanilla 


3.44 


137.84 


20.25 


ST: 2x seeding 


6.88 


158.11 


13.65 


ST: 5x seeding 


17.20 


179.41 


4.39 


ST: lOx seeding 


34.40 


190.81 


0.66 


CT: Persistent history 


3.44 


138.75 


0.13 



Table 2. Overview about system-level results. 

known with a probability of 0.9. This represents a worst- 
case scenario, as the literature has reported a superior accu- 
racy of 0.96% 1171 . 

6.2.3 Performance metrics 

We utilise two performance metrics to evaluate the effec- 
tiveness of the approaches. The first is the average down- 
loading rate of successful users {D) and the second is the 
fraction of download abortions (F). Both metrics are cal- 
culated on a per-torrent (Dt,Ft) and system-wide basis 
(Ds,Fs). 

6.3 Comparative Results 

Table [2] gives an overview of the three variants in which 
the measured seeding times are lengthened by a factor of 
either 2, 5, or 10. For comparability reasons, we assumed 
for the cross-torrent approach that users remain online af- 
ter downloading as long as they stay in vanilla BitTorrent. 
The results presented in this table summarize the fraction of 
download abortions (Fg) and the average downloading rate 
{Ds) on a system-wide level. Some points worth noting: 

• In the chosen set of torrents, 20% of downloads were 
not successful in vanilla BitTorrent. 

• To maintain persistent file availability in the single- 
torrent approach, e.g. to ensure a system wide success 
rate for downloads > 99%, the users must stay in av- 
erage 10 times longer after downloading. This induces 
an average seeding time of more than 34 hours. 

• The cross-torrent approach achieves a similar down- 
loading failure ratio as the single-torrent variant 
lengthening the seeding times by a factor of 10. How- 
ever, this is achieved without having to increase the 
seeding times beyond that currently observed in Bit- 
Torrent. With regard to download rates, the cross- 
torrent approach also performs similar to vanilla Bit- 
Torrent. 



BT: Vanilla 
ST: IQx seeding 




10 20 30 40 50 60 70 80 90 100 
Torrents 

Figure 10. Overview about the per-torrent 
abortion ratios. 
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Figure 1 1 . Average download rates on a per- 
torrent basis. 



be observed that, in a few torrents, even a 10 times multipli- 
cation of seeding times only has a marginal effect on reduc- 
ing the abortion ratio. On the other side, the cross-torrent 
approach obviously benefits from the available file replicas 
as the abortion ratio never exceeds the 2% threshold. 

When examining the download performance on a per- 



In addition to this table. Fig. 10 and Fig. 11 sepa- 
rately plot the fraction of aborted downloads and the aver- 
age downloading rate, respectively, on a torrent basis. It can 



torrent basis in Fig. 11 it can be observed that users apply- 
ing the cross-torrent solution actually increase their down- 
load rate when compared to vanilla BitTorrent. For in- 
stance, the average downloading rate (Dt) over all torrents 
is 102.26 KBps for vanilla BitTorrent and 121.65 KBps for 
the cross-torrent variant. As this is an average, it does not 
highlight the disparities between different torrents' perfor- 
mance based on popularity. Whilst not shown in the plots, 
the cross-torrent approach reallocates upload capacity from 
particularly popular torrents to other torrents. Therefore, 
43.55% of the users that finish in vanilla BitTorrent gain a 
performance increase of 88% when applying cross-torrent 
collaboration. However, the remaining 56.44% of these 
users suffer from an average performance decrease of 22%. 
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6.4 Summary 



References 



To conclude, even when considering the seeding dis- 
continuity of users, the single-torrent approach emerges as 
highly impracticable: to ensure a system-wide file avail- 
ability of >99%, average seeding times of more than 34 
hours are required. In contrast, the cross-torrent approach 
using persistent histories achieves this level of file availabil- 
ity easily. It therefore allows the 20% of users that could 
not complete their downloads in vanilla BitTorrent to ef- 
fectively download the file. The download performance of 
the cross-torrent approach is on a system level equivalent 
to vanilla BitTorrent, and even improves download times on 
a per-torrent basis. Although this finding is at first glance 
surprising, the cross-torrent approach benefits from the sig- 
nificant increase in nodes (>20%) which now find content 
available. This, in turn, allows users that previously could 
not access the content to download at high rates; this com- 
pensates for the inherent performance disadvantages of peer 
selection policies that are optimised for fairness [j6J. 

However, it must also be noted that the increase of file 
availability due to cross-torrent collaboration is achieved by 
a notable trade-off. That is, the download rate of more than 
half of the users that finish downloads with vanilla BitTor- 
rent accounting degrades by 22%. This can be contrasted 
with a 88% improvement for the remainder of peers. 

7 Conclusions 

This paper has investigated BitTorrent's unavailability 
problem in the wild and explored the feasibility of the po- 
tential solution-space. To achieve this, two large-scale mea- 
surements studies were performed to ascertain the charac- 
teristics, causes and repercussions of file unavailability in 
BitTorrent. Based on this, we made a number of interesting 
findings that offer the most accurate study of file availabil- 
ity in BitTorrent so far. Most notably, it was found that («) 
a lack of seeders often results in unavailability but not al- 
ways, (ii) the churn level, the fast replication of rare chunks 
and the population size largely defines a swarm's ability to 
survive without a seeder (Hi) unavailability usually occurs 
in cyclic periods with intermittent availability, and (iv) un- 
availability often results in a chain effect that leads to future 
download failures. 

Due to these new findings, the solution-space was also 
investigated to see how they affect both single and cross tor- 
rent solutions. It was found that the continuance of BitTor- 
rent's single-torrent mechanisms can only address the prob- 
lem with a 10 fold increase in seeding times. In contrast, 
great potential has been found in using the cross-torrent ap- 
proach which maintains current performance levels whilst 
also achieving over 99% availability. 
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